DTD-driven bilingual document generation
نویسندگان
چکیده
Extensively annotated bilingual parallel corpora can be exploited to feed editing tools that integrate the processes of document composition and translation. Here we discuss the architecture of an interactive editing tool that, on top of techniques common to most Translation Memory-based systems, applies the potential of SGML's DTDs to guide the process of bilingual document generation. Rather than employing just simple task-oriented mark-up, we selected a set of TEI's highly complex and versatile collection of tags to help disclose the underlying logical structure of documents in the test-corpus. DTDs were automatically induced and later integrated in the editing tool to provide the basic scheme for new documents. 1 I n t r o d u c t i o n This paper discusses an approach to the architecture of an experimental interactive editing tool that integrates the processes of source document composition and translation into the target language. The tool has been conceived as an optimal solution for a particular case of bilingual production of legal documentation, but it also illustrates in a more general way how to exploit the possibilities of SGML (ISO8879, 1986) used extensively to annotate a whole range of linguistic and extralinguistic information in specialized bilingual corpora. SGML is well established as the coding scheme underlying most Translation Memory based systems (TMBS), and has been proposed as the cod-it~g scheme for the interchange of existing Translation Memory databases Translation Meinories eXchange, TMX (Melby, 1998). The advantages of SGML have also been perceived by a large conmmnity of corpus linguistics researchers, and big efforts have been made in the development of suitable markup options to encode a variety of textual types and functions -as clearly demonstrated by the Text Encoding Initiative, TEI; (Burnard & SpebergMacQueen, 1995). While the tag-sets employed by TMBS are simple and task-oriented, TEI has offered a highly complex and versatile collection of tags. The guiding hypothesis in our experiment has been the idea that it is possible to explore TEI/SGML markup in order to develop a system that carries the concept of Translation Memory one step further. One important leature of SGML is the DTD. DTDs determine the logical structure of documents and how to tag them accordingly. We have concentrated on the accurate description of documents by means of TEI conformant SGML markup. The markup will help disclose the underlying logical structure of documents. From annotated documentation, DTDs can be induced and these DTDs provide the basic scheme to produce new documents. We have collected a corpus of official publications from three main institutions in the Basque Autonomous Region in Spain, the Boletln Oficial de Bizkaia (BOB, 1990-1995), Botetln Oficial de Alava (BOA, 1990-1994) and Bolet{n Oficial del Pais Vasco (BOPV, 1995). Documents in the corpus were composed by Adnfinistration clerks and translated by translators. Both clerks and translators have been using a wide variety of word-processors, although since 1994 MSWord has been generalized as the standard editing tool. Administrative documentation shows a regular structure, and is rich • in*recurrent textual pa t t e rns . For each docu. . . . . ment type different document tokens share a common global distribution of elements. Official document composers learn these global structures and apply them consistently. It is also the case that composers tend to reuse old
منابع مشابه
Automatic generation of SGML content models
We study the problem of automatic generation of a document type definition (DTD) for a set of Standard Generalized Markup Language (SGML) documents. We present various situations where we have tagged documents but no DTD, and discuss the requirements various applications may have with respect to the generation process.We also present an automatic DTD generation tool that can be adjusted for sev...
متن کاملRule-Based Generation of XML DTDs from UML Class Diagrams
We present an approach of how to extract automatically an XML document structure from a conceptual data model that describes the content of a document. We use UML class diagrams as the conceptual model that can be represented in XML syntax (XMI). The algorithm we present in the paper is implemented as a set of rules that transform the UML class diagram into an adequate document type definition ...
متن کاملStructuring Domain-Specific Text Archives by Deriving a Probabilistic XML DTD
Domain-specific documents often share an inherent, though undocumented structure. This structure should be made explicit to facilitate efficient, structure-based search in archives as well as information integration. Inferring a semantically structured XML DTD for an archive and subsequently transforming its texts into XML documents is a promising method to reach these objectives. Based on the ...
متن کاملFrom Document Type Definitions to Metamodels: The WebML Case Study
Metamodels are a prerequisite for model-driven engineering (MDE). In the past, DTDs have also been deployed for language definitions. MDE techniques and tools can not be reused for such languages, however. The WebML web modeling language for modeling web applications is one example that does not yet rely on an explicit metamodel. Instead it is implicitly defined within the methodology’s accompa...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2000